IBM Applied Data Science - Capstone Project

Notebook used for submission of peer-graded assignment

First full ML project in python

Prediction of Car Accident Severity in Seattle, USA

Abstract

The purpose of this exploratory data analysis exercise is to assess the possibility and accuracy to predict car accident severity in Seattle, USA by means of supervised machine learning models, exploiting collision track records from past accidents that were recorded by the Seattle Police Department (SPD) and provided as open data by Traffic Records.

Being able to predict car accident severity from extrenal factors like weather, location, road conditions as well as speeding, influence of alcohol/drugs etc. will allow the government to put appropriate meassures in place to reduce accident severity, but above all, allow the police and first response teams to channel their ressources and increase efficiency.

Using car accident track records from March 2013, three different machine-learning methods, namely K-Neirest Neighbours (KNN), Decision Trees, and Logistic Regressors, were benchmarked against each other

While the exploratory data analysis suggests, that almost 90% of all accidents, involving pedestrians and 88% of collisions involving cyclists lead to injuries (compare: 28% of accidents without pedestrians/cyclists lead to injuries), the tested machine learning models generally had problems to correctly predict SEVERITYCLASS=2 and therefore exhibited a high number of false negatives for this class which can in real life lead to a wrong allocation of ressources of police and first responders and potentially end deadly..

The study was carried out in October 2020.


1. Introduction

Despite the fact that the US population has increased threefold since the beginning of the 20th century and the total number of cars cracking the 280 million mark in 2019 (source), leading to a whopping 3000 billion miles travelled p.a., fatality rates in traffice continously decline Fig. 1. This decrease of deaths in car accidents is related to measures, enforced by the law (e.g. seat belt law, 1968), advanced safety features (mandatory air bags (1998)), but also improved road safety (e.g. signs, traffic lights etc.).

Moreover, thanks to technical advances in computer technology in the last decades, efficient measures can be taken even after an accident happened, e.g. by minimizing the response time of emergency teams and police through the smart analysis of accident records.

This coding exercise demonstrates how collision records can be analysed and provide insight into predicting car accident severity using the example of Seattle, USA.

Fig. 1: Development of annual car accident fatalities in the US. The graph shows annual US miles traveled (blue), traffic fatalities per billion vehicle miles traveled (red), per million people (orange), total annual deaths (light blue), Vehicle miles travelled (VMT) in 10s of billions (dark blue) and US population in millions (turquise) (source).


2. Data Overview

The data used in this project is was recorded by the Seattle Police Department (SPD) and provided as open data by Traffic Records. The spradsheet can be downloaded free of charge from the Seattle GeoData Portal and have been acquired since 2004.

The dataset used in this study comprises 194673 accidents, recorded and documented in Seattle in March 2013. Apart from the target variable, the severity of the collision SEVERITYCODE, meta information from 37 additional features are available. These comprise information about location of the accident (e.g. X,Y,LOCATION, JUNCTIONTYPE, CROSSWALKKEY, SEGLANEKEY,HITPARKEDCAR, ADDRTYPE), external factors (e.g. LIGHTCOND, WEATHER, ROADCOND) or lawlessness (e.g. INATTENTIONIND,UNDERINFL, PEDROWNOTGRN, SPEEDING).

The severity of the collision is categorised as

  • 1: Accidents resulting in property damage
  • 2: Accidents resulting in injuries

For additional information about the individual attributes kindly refer to the official corresponding documentation that can be found here.

A first overview of the spatial distribution of car accidents in 2013 can be found below. Note the increased number of accident at intersections (zoom in).

2.1 Libraries

In [116]:
%matplotlib inline


# import libraries
import os.path
from IPython.display import clear_output

import pandas as pd 
import numpy as np 
import random

import matplotlib.pyplot as plt
import seaborn as sns

import folium
from folium import plugins

import imblearn
from imblearn.over_sampling import SMOTE

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier 

from sklearn import metrics
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve, auc

2.2 Data Loading

In [2]:
# Load the data and save data to pandas data frame
data_url = 'https://s3.us.cloud-object-storage.appdomain.cloud/cf-courses-data/CognitiveClass/DP0701EN/version-2/Data-Collisions.csv'
df = pd.read_csv(data_url)
/Users/martin/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3020: DtypeWarning: Columns (33) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
In [454]:
# Plot heatmap from accidents

loc = 'Density Map of Traffic Accidents, Seattle (March 2013)'
title_html = '''
             <h3 align="center" style="font-size:15px"><b>{}</b></h3>
             '''.format(loc)  

locations = df[['Y', 'X']][df['Y'].notna()]
heatmap = folium.Map(location=[47.605, -122.3], zoom_start=11, tiles='CartoDB dark_matter')

heatmap.add_child(folium.plugins.HeatMap(locations[['Y', 'X']].values, radius=8, cmap='viridis'))
heatmap.get_root().html.add_child(folium.Element(title_html))
heatmap
Out[454]:
Make this Notebook Trusted to load map: File -> Trust Notebook

3. Methodology

The data analysis workflow is designed as follows:

1. Data Loading:

  • Downloading the data from the online repository and storing into Pandas dataframe.

2. Data Overview:

  • Exploring the data dimensionality, format, quality, and volume.

3. Data Clean-up:

  • Re-formatting and encoding the data to make fit-for-purpose.

4. Exploratory Data Analysis:

  • Getting insight into distribution, frequency, anc correlation of features.
  • Pre-selection of input features for model training.

5. Model Building:

  • Normalizing the data
  • Benchmarking different models against each other (K-Nearest Neighbor, Decision Tree, Logistic Regression)

(6. Model Deployment and Evaluation)


4. Code

Note: Data has already been loaded to display the heat map here.{: .note}

Used libraries can be found above here.

Observations are documented in the code ore below in blue boxes.

Key variables:

df: Data frame with raw data

df_c_fin: Data frame with cleaned-up data

*_INT: Factorised object-type feature in dataframe


4.1 Functions

In order to simplify parts of the data analysis, various functions were written. This was mainly done for practicing.

In [4]:
# Count number of NaNs in dataframe

def df_qc(df):

    missing_data = df.isnull()
    df_nan = pd.DataFrame(columns=['ID', 'val', 'nan'])

    for column in missing_data.columns.values.tolist():
        try:
            val_ok = missing_data[column].value_counts()[0]
            val_nan = missing_data[column].value_counts()[1]

            df_new_row = pd.DataFrame(data=np.array([[column,val_ok,val_nan]]), columns=['ID','val','nan'])
            df_nan = pd.concat([df_nan,df_new_row], ignore_index=True)

        except:
            val_ok = missing_data[column].value_counts()[0]
            df_new_row = pd.DataFrame(data=np.array([[column,val_ok,0]]), columns=['ID','val','nan'])
            df_nan = pd.concat([df_nan,df_new_row], ignore_index=True)

    # cast values as integer
    df_nan = df_nan.astype({'val' :'int','nan':'int'})

    df_nan.set_index('ID', inplace=True)
    df_nan.sort_values(['nan'], ascending=True, axis=0, inplace=True)

    return df_nan


# Function to plot histograms from features
def plt_hist_col(data_frame,column_name):
    '''
    Function to plot distribution of unique value counts of a SINGLE COLUMN in a data frame.
    
    INPUT:
    data_frame: Pandas data frame
    column_name: exact column name (STR)
    '''
    
    tot_len = len(data_frame[column_name])
    data_frame_perc = round((pd.Series(data_frame[column_name]).value_counts() / tot_len * 100),3)
    td = data_frame_perc.to_frame()

    ax1 = td.plot(kind="bar", figsize=(14,3), rot=90, width = 0.4)
    ax1.set_yticks(np.arange(0, 110, 10))
    ax1.set_title('Relative distribution of value counts for '+ str(column_name)+'  ('+str(tot_len)+' total)', fontsize=12)
    
    for p in ax1.patches:
        ax1.annotate(str(p.get_height()), (p.get_x() * 1.005, p.get_height() + 2), fontsize=10, color="r")
    plt.show()
    
    return(ax1)

        

    
# Interactively QC all column data types and output list with features, that shall be removed
def df_type_qc(data_frame_r, verbose=True):
    
    '''
    Function to interactively QC all column data types and cast them into other types if required.
    Returns a list del_list with column names to be deleted in next step.
    
    INPUT:
    data_frame_r: Pandas data frame
    verbose=True: Output of statistics True/False (BOOLEAN)
    OUTPUT:
    data_frame: QCed data frame
    del_list: List with column names to be deleted
    '''
    
    data_frame = data_frame_r.copy() # copy input df
    hist_max = 50 # number of maximum unique values to plot histogram
    del_list = [] # list with columns to be deleted
    i = 0
    cast = ""
    
    len_df = len(data_frame.columns) # number of columns
    missing_data = data_frame.isnull() # count missing data and create df with True/False
    
    # iterate through all columns and check data type
    # user input to cast into different data type
    
    while i < (len_df) and cast != "x": # total number of columns
        col_name = str(data_frame.columns[i]) #column name
        
        
        
        # calculate missing data per column
        val_nan = 0
        val_ok = 0
        
        try:
            val_ok = missing_data[col_name].value_counts()[0]
            val_nan = missing_data[col_name].value_counts()[1]
            tot_count = len(data_frame[col_name])

        except:
            val_ok = missing_data[col_name].value_counts()[0]
            tot_count = len(data_frame[col_name])
 
        # if verbose=True output 
        if verbose:
            #clear_output()
            print(f"######################################")
            print(f"{i+1}/{len_df}: {col_name}")
            #print("----------STATS----------")
            print(f"\ttotal:\t{tot_count}")
            print(f"\tOK:\t{val_ok}\t({val_ok/tot_count*100}%)")
            print(f"\tNaN:\t{val_nan}\t({val_nan/tot_count*100}%)")
            
            print("----------HEAD----------")
            print(f"{data_frame[col_name].head()}\n")
            #random.randint(0,100)
            
            
            if len(data_frame[col_name].value_counts()) <= hist_max:              
                plt_hist_col(data_frame,col_name)
            
        print(f"x: quit\ni: cast to INT\nf: cast to FLT\no: cast to OBJ\nd: mark to DELETE (add to del_list)\nelse: skip")
        cast = input(">>")
              
        
        if cast == "i":
            data_frame[[col_name]] = data_frame[[col_name]].astype("int64", errors='ignore')
        elif cast == "f":
            data_frame[[col_name]] = data_frame[[col_name]].astype("float64", errors='ignore')
        elif cast == "o":
            data_frame[[col_name]] = data_frame[[col_name]].astype("object", errors='ignore')
        elif cast == "d":
            del_list.append(col_name)

        else:
            pass 
        
    
        i += 1

    return (data_frame,del_list) # return QCed df and list with column names to be deleted

4.2 Data Overview

In [5]:
# all raw data stored in dataframe df

print(f"Dimensions: {df.shape}")
df.info()
Dimensions: (194673, 38)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194673 entries, 0 to 194672
Data columns (total 38 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   SEVERITYCODE    194673 non-null  int64  
 1   X               189339 non-null  float64
 2   Y               189339 non-null  float64
 3   OBJECTID        194673 non-null  int64  
 4   INCKEY          194673 non-null  int64  
 5   COLDETKEY       194673 non-null  int64  
 6   REPORTNO        194673 non-null  object 
 7   STATUS          194673 non-null  object 
 8   ADDRTYPE        192747 non-null  object 
 9   INTKEY          65070 non-null   float64
 10  LOCATION        191996 non-null  object 
 11  EXCEPTRSNCODE   84811 non-null   object 
 12  EXCEPTRSNDESC   5638 non-null    object 
 13  SEVERITYCODE.1  194673 non-null  int64  
 14  SEVERITYDESC    194673 non-null  object 
 15  COLLISIONTYPE   189769 non-null  object 
 16  PERSONCOUNT     194673 non-null  int64  
 17  PEDCOUNT        194673 non-null  int64  
 18  PEDCYLCOUNT     194673 non-null  int64  
 19  VEHCOUNT        194673 non-null  int64  
 20  INCDATE         194673 non-null  object 
 21  INCDTTM         194673 non-null  object 
 22  JUNCTIONTYPE    188344 non-null  object 
 23  SDOT_COLCODE    194673 non-null  int64  
 24  SDOT_COLDESC    194673 non-null  object 
 25  INATTENTIONIND  29805 non-null   object 
 26  UNDERINFL       189789 non-null  object 
 27  WEATHER         189592 non-null  object 
 28  ROADCOND        189661 non-null  object 
 29  LIGHTCOND       189503 non-null  object 
 30  PEDROWNOTGRNT   4667 non-null    object 
 31  SDOTCOLNUM      114936 non-null  float64
 32  SPEEDING        9333 non-null    object 
 33  ST_COLCODE      194655 non-null  object 
 34  ST_COLDESC      189769 non-null  object 
 35  SEGLANEKEY      194673 non-null  int64  
 36  CROSSWALKKEY    194673 non-null  int64  
 37  HITPARKEDCAR    194673 non-null  object 
dtypes: float64(4), int64(12), object(22)
memory usage: 56.4+ MB

The input data contains 194673 entries and is split into 38 features of which SEVERITYCODE will be the target value to predict.


In [6]:
# Distribution of the target value SEVERITYCODE

feature="SEVERITYCODE"
df[feature].value_counts().to_frame()
Out[6]:
SEVERITYCODE
1 136485
2 58188
In [7]:
# NaN values per feature
df_qc(df)
Out[7]:
val nan
ID
SEVERITYCODE 194673 0
SEGLANEKEY 194673 0
SDOT_COLDESC 194673 0
SDOT_COLCODE 194673 0
INCDTTM 194673 0
INCDATE 194673 0
VEHCOUNT 194673 0
CROSSWALKKEY 194673 0
PEDCOUNT 194673 0
PERSONCOUNT 194673 0
SEVERITYDESC 194673 0
SEVERITYCODE.1 194673 0
PEDCYLCOUNT 194673 0
HITPARKEDCAR 194673 0
OBJECTID 194673 0
COLDETKEY 194673 0
REPORTNO 194673 0
STATUS 194673 0
INCKEY 194673 0
ST_COLCODE 194655 18
ADDRTYPE 192747 1926
LOCATION 191996 2677
UNDERINFL 189789 4884
COLLISIONTYPE 189769 4904
ST_COLDESC 189769 4904
ROADCOND 189661 5012
WEATHER 189592 5081
LIGHTCOND 189503 5170
X 189339 5334
Y 189339 5334
JUNCTIONTYPE 188344 6329
SDOTCOLNUM 114936 79737
EXCEPTRSNCODE 84811 109862
INTKEY 65070 129603
INATTENTIONIND 29805 164868
SPEEDING 9333 185340
EXCEPTRSNDESC 5638 189035
PEDROWNOTGRNT 4667 190006

The table above, lists the number of NaN values per feature. Whether this means that the respective feature is False or indicates a missing value, needs to be evaluated feature by feature.

4.3 Data Clean-Up

In [8]:
# Extract Year and Month and create new column
df_c = df.copy()
df_c["YEAR"] = int(df["INCDATE"][0][0:4])
df_c["MONTH"] = int(df["INCDATE"][0][5:7])
df_c.drop(['INCDATE'], axis=1, inplace=True)
In [456]:
# Run interactive function to reformat columns
df_c_qc,del_list = df_type_qc(df_c)

The interactive script allows to efficiently go through the individual features, calculate the NaN values, data type, and show the distribution of unique vales (if below 50) in a histogram plot. User input is required to decide if the datatype of the respective column shall be casted into another format or if the whole column shall be flagged to be deleted in aa subsequent step. This is done in case the entry is irrelevant for the model building (e.g. YEAR/ MONTHsince the data comprises only one month or ST_COLCO / ST_COLDESC which describe the accident) or if it contains too many NaN values.

In [24]:
# list with the flagged features that will be deleted
del_list = ["COLDETKEY", "REPORTNO", "STATUS", "INTKEY", "EXCEPTRSNCODE", "EXCEPTRSNDESC",
           "SEVERITYCODE.1","SEVERITYDESC","INCDTTM","PEDROWNOTGRNT","SDOTCOLNUM",
            "ST_COLCODE","ST_COLDESC","YEAR","MONTH"]
In [25]:
# delete columns based on del_list entries
df_c_qc = df_c.copy()
df_c_qc.drop(del_list, axis=1, inplace=True)

# reformat and encode individual column entries into integers for 
# simple math. Hot-encoding is done in a subsequent step
feature="ADDRTYPE"
df_c_qc = df_c_qc[df_c_qc[feature].notna()]

feature="INATTENTIONIND"
df_c_qc[feature] = df_c_qc[feature].replace(np.nan, "0")
df_c_qc[feature] = df_c_qc[feature].replace("Y", "1")
df_c_qc[feature] = df_c_qc[feature].astype("int64", errors='ignore')

feature="HITPARKEDCAR"
df_c_qc[feature] = df_c_qc[feature].replace(np.nan, "0")
df_c_qc[feature] = df_c_qc[feature].replace("Y", "1")
df_c_qc[feature] = df_c_qc[feature].replace("N", "0")
df_c_qc[feature] = df_c_qc[feature].astype("int64")

feature="UNDERINFL"
df_c_qc[feature] = df_c_qc[feature].replace(np.nan, "0")
df_c_qc[feature] = df_c_qc[feature].replace("N", "0")
df_c_qc[feature] = df_c_qc[feature].replace("Y", "1")
df_c_qc[feature] = df_c_qc[feature].astype("int64", errors='ignore')

feature="SPEEDING"
df_c_qc[feature] = df_c_qc[feature].replace("Y", "1")
df_c_qc[feature] = df_c_qc[feature].replace(np.nan, "0")
df_c_qc[feature] = df_c_qc[feature].astype("int64", errors='ignore')

feature="WEATHER"
df_c_qc[feature] = df_c_qc[feature].replace("Other", "Unknown")
df_c_qc[feature] = df_c_qc[feature].replace(np.nan, "Unknown")
df_c_qc[(feature)+'_INT'] = df_c_qc[feature].replace({'Clear': '1', 
                                                      'Partly Cloudy': '2', 
                                                      'Overcast': '3', 
                                                      'Severe Crosswind': '4', 
                                                      'Raining':'6',
                                                      'Blowing Sand/Dirt':'7',
                                                      'Sleet/Hail/Freezing Rain': '8',
                                                      'Fog/Smog/Smoke':'9',
                                                      'Snowing':'10',
                                                      'Unknown':'5',               
                                                     })
df_c_qc[(feature)+'_INT'] = df_c_qc[(feature)+'_INT'].astype("int64", errors='ignore')

feature="ROADCOND"
df_c_qc[feature] = df_c_qc[feature].replace("Other", "Unknown")
df_c_qc[feature] = df_c_qc[feature].replace(np.nan, "Unknown")
df_c_qc[(feature)+'_INT'] = df_c_qc[feature].replace({'Dry': '1', 
                                                      'Sand/Mud/Dirt': '2', 
                                                      'Wet': '3', 
                                                      'Unknown': '4', 
                                                      'Standing Water':'5',
                                                      'Snow/Slush':'6',
                                                      'Ice': '7',
                                                      'Oil':'8',              
                                                     })
df_c_qc[(feature)+'_INT'] = df_c_qc[(feature)+'_INT'].astype("int64", errors='ignore')

feature="LIGHTCOND"
df_c_qc[feature] = df_c_qc[feature].replace("Other", "Unknown")
df_c_qc[feature] = df_c_qc[feature].replace(np.nan, "Unknown")
df_c_qc[feature] = df_c_qc[feature].replace("Dark - No Street Lights", "Dark - Street Lights Off")
df_c_qc[(feature)+'_INT'] = df_c_qc[feature].replace({'Daylight': '1', 
                                                      'Dusk': '2', 
                                                      'Dawn': '3', 
                                                      'Dark - Street Lights On': '4',
                                                      'Unknown': '5' ,
                                                      'Dark - Unknown Lighting':'6',
                                                      'Dark - Street Lights Off':'7',        
                                                     })
df_c_qc[(feature)+'_INT'] = df_c_qc[(feature)+'_INT'].astype("int64")



feature="LOCATION"
df_c_qc[(feature)+'_INT'] = pd.factorize(df_c_qc[feature])[0] + 1

feature="JUNCTIONTYPE"
df_c_qc[feature] = df_c_qc[feature].replace(np.nan, "Unknown")
df_c_qc[(feature)+'_INT'] = df_c_qc[feature].replace({'Mid-Block (not related to intersection)': '1', 
                                                      'At Intersection (intersection related)': '2', 
                                                      'Mid-Block (but intersection related)': '3', 
                                                      'Unknown': '4',
                                                      'At Intersection (but not related to intersection)': '5' ,
                                                      'Ramp Junction':'6',
                                                      'Driveway Junction':'7',        
                                                     })
df_c_qc[(feature)+'_INT'] = df_c_qc[(feature)+'_INT'].astype("int64")

feature="ADDRTYPE"
#df_c_qc[(feature)+'_INT'] = pd.factorize(df_c_qc[feature])[0] + 1
df_c_qc[(feature)+'_INT'] = df_c_qc[feature].replace({'Block': '1', 
                                                      'Alley': '2', 
                                                      'Intersection': '3',       
                                                     })
df_c_qc[(feature)+'_INT'] = df_c_qc[(feature)+'_INT'].astype("int64")

#plt_hist_col(df_c_qc,'ADDRTYPE_INT')
#df_c_qc.info()
#df_c_qc['ADDRTYPE'].head()
#df['ADDRTYPE_INT'].value_counts().to_frame()
In [26]:
plt_hist_col(df_c_qc,'WEATHER')
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd329eb7b8>

4.4 Exploratory Data Analysis

In this subsection, data correlations and feature relevance are being examined.

In [27]:
# Correlation matrix of cleaned-up dataframe
corr = df_c_qc.corr()

plt.figure(figsize = (16,6))
corr_plot = sns.heatmap(corr,xticklabels=corr.columns,yticklabels=corr.columns,cmap='RdBu_r', vmin=-1, vmax=1, annot=True)
corr_plot.set_title('Correlation Matrix', fontsize=12)
corr_plot
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fdd42b799b0>

A first look into the correlation matrix of the features from the cleaned-up dataframe reveals:

  • As expected X/ Y have a low correlation value with the other data and can therefore be discarded
  • Positive correlation between the involvement of pedestrians (PEDCOUNT /CROSSWALKKEY / SEGLANEKEY) or cyclists (PEDCLCOUNT) and severity SEVERITYCODE.
  • A high positive correlation between WEATHER_INT and ROADCOND_INT
  • A mild correlation between UNDERINFL and LIGHTCOND_INT suggesting drunk driving occurs more often at night.

Moreover, features like OBJECTID, COLLISIONTYPE, SDOT_COLCODE and SDOT_COLCODESC are assigned by the police after the accident and will therefore be dropped.

In [28]:
# clean-up final dataframe
df_c_fin = df_c_qc.drop(['X','Y','OBJECTID',
                         'SDOT_COLCODE', 'SDOT_COLDESC', 'ADDRTYPE', 'LOCATION', 
                         'COLLISIONTYPE'], axis=1)
In [29]:
feature="SEVERITYCODE"
s2_values = round(df_c_fin[feature].value_counts()[2] / (df_c_fin[feature].value_counts()[1]+df_c_fin[feature].value_counts()[2])*100,2)
print(f"Relative amount of type 2 (injuries) accidents of total dataset: {s2_values}%.")
Relative amount of type 2 (injuries) accidents of total dataset: 30.09%.

Analysis of the relative distribution of accidents, leading to property damage only (SEVERITYCODE=1) and accidents resulting in injuries (SEVERITYCODE=2), shows that with 30%, the latter is underrepresented in the unbalanced input dataset. Consequently all resulting models will be biased towards the former outcome of a collision. An attempt to overcome this issue is made through Synthetic Minority Oversampling below.

In [30]:
# Impact of features on severitycode #1: traffic offenses

target = "SEVERITYCODE"
features_law = ["UNDERINFL", "SPEEDING", "INATTENTIONIND"]

fig = plt.figure(figsize=(20, 8)) # create figure
fig.suptitle("Severitycode vs Traffic Offenses", fontsize='20')
ax1 = fig.add_subplot(131)
ax2 = fig.add_subplot(132)
ax3 = fig.add_subplot(133)

#df_c_fin.plot(kind='bar', ax=ax1)
sns.catplot(x=features_law[0], kind='count', hue=target, data=df_c_fin, ax=ax1)
sns.catplot(x=features_law[1], kind='count', hue=target, data=df_c_fin, ax=ax2)
sns.catplot(x=features_law[2], kind='count', hue=target, data=df_c_fin, ax=ax3)
ax1.legend(loc='upper right',title=target)
ax2.legend(loc='upper right',title=target)
ax3.legend(loc='upper right',title=target)

plt.close(2) 
plt.close(3)
plt.close(4)
In [31]:
for i,f in enumerate(features_law):
    
    sev1Y = df_c_fin[(df_c_fin["SEVERITYCODE"] == 1) & (df_c_fin[f] == 1)].shape[0]
    sev1N = df_c_fin[(df_c_fin["SEVERITYCODE"] == 1) & (df_c_fin[f] == 0)].shape[0]
    
    sev2Y = df_c_fin[(df_c_fin["SEVERITYCODE"] == 2) & (df_c_fin[f] == 1)].shape[0]
    sev2N = df_c_fin[(df_c_fin["SEVERITYCODE"] == 2) & (df_c_fin[f] == 0)].shape[0]
    
    print(f"{f}:\n  {round(sev2Y / (sev1Y+sev2Y) * 100,2)}% of accidents with {f}=1 have {target}=2.")
    print(f"  {round(sev2N / (sev1N+sev2N) * 100,2)}% of accidents with {f}=0 have {target}=2.")
UNDERINFL:
  39.06% of accidents with UNDERINFL=1 have SEVERITYCODE=2.
  29.65% of accidents with UNDERINFL=0 have SEVERITYCODE=2.
SPEEDING:
  37.87% of accidents with SPEEDING=1 have SEVERITYCODE=2.
  29.7% of accidents with SPEEDING=0 have SEVERITYCODE=2.
INATTENTIONIND:
  34.97% of accidents with INATTENTIONIND=1 have SEVERITYCODE=2.
  29.2% of accidents with INATTENTIONIND=0 have SEVERITYCODE=2.

Analysis of UNDERINFL, SPEEDING, and INATTENTIONIND vs SEVERITYCODE reveals that traffic offenses increase the risks for accidents resulting in injuries (SEVERITYCODE 2) by up to 10%.

In [32]:
# Impact of features on severitycode #2: pedestrians / cyclists

target = "SEVERITYCODE"
features_ped = ["PEDCOUNT", "PEDCYLCOUNT"]

fig = plt.figure(figsize=(20, 8)) # create figure
fig.suptitle("Severitycode vs Pedestrians / Cyclists", fontsize='20')
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

sns.catplot(x=features_ped[0], kind='count', hue=target, data=df_c_fin, ax=ax1)
sns.catplot(x=features_ped[1], kind='count', hue=target, data=df_c_fin, ax=ax2)
ax1.legend(loc='upper right',title=target)
ax2.legend(loc='upper right',title=target)

plt.close(2) 
plt.close(3)
In [33]:
for i,f in enumerate(features_ped):
    
    sev1N = df_c_fin[(df_c_fin["SEVERITYCODE"] == 1) & (df_c_fin[f] == 0)].shape[0]
    sev1Y = df_c_fin[(df_c_fin["SEVERITYCODE"] == 1) & (df_c_fin[f] == 1)].shape[0]
    
    sev2N = df_c_fin[(df_c_fin["SEVERITYCODE"] == 2) & (df_c_fin[f] == 0)].shape[0]
    sev2Y = df_c_fin[(df_c_fin["SEVERITYCODE"] == 2) & (df_c_fin[f] == 1)].shape[0]
    
    print(f"{f}:\n  {round(sev2Y / (sev1Y+sev2Y) * 100,2)}% of accidents with {f}>=1 have {target}=2.")
    print(f"  {round(sev2N / (sev1N+sev2N) * 100,2)}% of accidents with {f}=0 have {target}=2.")
PEDCOUNT:
  89.85% of accidents with PEDCOUNT>=1 have SEVERITYCODE=2.
  27.86% of accidents with PEDCOUNT=0 have SEVERITYCODE=2.
PEDCYLCOUNT:
  87.53% of accidents with PEDCYLCOUNT>=1 have SEVERITYCODE=2.
  28.41% of accidents with PEDCYLCOUNT=0 have SEVERITYCODE=2.

Analysis of PEDCOUNT and PEDCYLCOUNT vs SEVERITYCODE reveals that accidents involving pedestrians or cyclists accidents increase the risk of injuries(SEVERITYCODE 2) by up to 70%.

In [34]:
# Impact of features on severitycode #3: weather, daytime, road condition

target = "SEVERITYCODE"
features_ext = ["WEATHER", "ROADCOND", "LIGHTCOND"]

fig = plt.figure(figsize=(20, 8))

ax1 = sns.catplot(x=features_ext[0], kind='count', hue=target, data=df_c_fin, height=4, aspect=5)
ax1.fig.suptitle(('Impact of '+features_ext[0]+' on '+target),size=20)

ax2 = sns.catplot(x=features_ext[1], kind='count', hue=target, data=df_c_fin, height=4, aspect=5)
ax2.fig.suptitle(('Impact of '+features_ext[1]+' on '+target),size=20)

ax3 = sns.catplot(x=features_ext[2], kind='count', hue=target, data=df_c_fin, height=4, aspect=5)
ax3.fig.suptitle(('Impact of '+features_ext[2]+' on '+target),size=20)

plt.show()
<Figure size 1440x576 with 0 Axes>
In [35]:
# Playing around with dictionary for practice here
#initialise dictionary to count distibution of feature values per SEVERITYCODE
sev_dict = {1:[0,0],2:[0,0],3:[0,0],4:[0,0],5:[0,0],6:[0,0],7:[0,0],8:[0,0],9:[0,0],10:[0,0]}

features_ext_int = ["WEATHER_INT", "ROADCOND_INT", "LIGHTCOND_INT"]

for feature_name in features_ext_int:
    print(feature_name)
    #outer loop over severity code (1/2)
    for sevcode in range(1,3,1):
        feature_length = df_c_fin[feature_name].nunique() #count number of unique features

        for feature_code in range(1,feature_length+1,1):
            sev_dict[feature_code][sevcode-1] = df_c_fin[(df_c_fin["SEVERITYCODE"] == sevcode) & (df_c_fin[feature_name] == feature_code)].shape[0]


            total = sev_dict[feature_code][0]+sev_dict[feature_code][1]
            if sevcode == 2:
                print(f"  {round(sev_dict[feature_code][1] / (total) * 100,2)}% of accidents with feature code = {feature_code} have {target}=2. ({total} counts)")
        
WEATHER_INT
  32.32% of accidents with feature code = 1 have SEVERITYCODE=2. (110626 counts)
  60.0% of accidents with feature code = 2 have SEVERITYCODE=2. (5 counts)
  31.62% of accidents with feature code = 3 have SEVERITYCODE=2. (27584 counts)
  28.0% of accidents with feature code = 4 have SEVERITYCODE=2. (25 counts)
  9.91% of accidents with feature code = 5 have SEVERITYCODE=2. (19876 counts)
  33.78% of accidents with feature code = 6 have SEVERITYCODE=2. (33004 counts)
  26.0% of accidents with feature code = 7 have SEVERITYCODE=2. (50 counts)
  24.11% of accidents with feature code = 8 have SEVERITYCODE=2. (112 counts)
  33.04% of accidents with feature code = 9 have SEVERITYCODE=2. (563 counts)
  18.74% of accidents with feature code = 10 have SEVERITYCODE=2. (902 counts)
ROADCOND_INT
  32.24% of accidents with feature code = 1 have SEVERITYCODE=2. (123945 counts)
  29.73% of accidents with feature code = 2 have SEVERITYCODE=2. (74 counts)
  33.23% of accidents with feature code = 3 have SEVERITYCODE=2. (47279 counts)
  9.5% of accidents with feature code = 4 have SEVERITYCODE=2. (19081 counts)
  26.13% of accidents with feature code = 5 have SEVERITYCODE=2. (111 counts)
  16.65% of accidents with feature code = 6 have SEVERITYCODE=2. (997 counts)
  22.58% of accidents with feature code = 7 have SEVERITYCODE=2. (1196 counts)
  37.5% of accidents with feature code = 8 have SEVERITYCODE=2. (64 counts)
LIGHTCOND_INT
  33.28% of accidents with feature code = 1 have SEVERITYCODE=2. (115468 counts)
  33.16% of accidents with feature code = 2 have SEVERITYCODE=2. (5856 counts)
  33.04% of accidents with feature code = 3 have SEVERITYCODE=2. (2491 counts)
  29.9% of accidents with feature code = 4 have SEVERITYCODE=2. (48301 counts)
  9.55% of accidents with feature code = 5 have SEVERITYCODE=2. (17901 counts)
  36.36% of accidents with feature code = 6 have SEVERITYCODE=2. (11 counts)
  23.87% of accidents with feature code = 7 have SEVERITYCODE=2. (2719 counts)

Analysis of external features like WEATHER, ROADCOND, or LIGHT_COND reveal that:

  • Most accidents, resulting in injuries occur under partly cloudy WEATHER conditions. However, this result is statistically not stable, as only 5 accidents occured under this condition. Other weather conditions resulting in injuries are rainy (33.78%) and fog/smog/smoke (33.04%) and clear (32.32%). While it is logic, that reduced visibility and aquaplaning may increase accident numbers, high injury rates during clear weather seems counterintuituve at first glance. However, considering the fact that involvement of pedestrians / cyclists in accidents increases the risk of injury by around 70% and assuming that it is more likely for them to be on the streets if the weather is good, this is less surprising.
  • Oily road conditions (ROADCOND) bare the highest risk to lead to accidents with injuries (37.5%). With only 64 counts, however, this statement is statistically not stable. Other road conditions, leading to accidents with injuries (wet, 33.23% and dry 32.24%) correlate with the weather conditions.
  • Analysis of the LIGHTCOND feature shows that around 1/3 of all accidents happening during daylight, dusk, or dawn lead to injuries. Injuries are less likely to happen during night time due to the reduced number of pedestrians / cyclists on the street. The suspiciously low value of 9.55% is corresponds to the "Unknown" category.
In [36]:
# Impact of features on severitycode #3: location

#initialise dictionary to count distibution of feature values per SEVERITYCODE
sev_dict = {1:[0,0],2:[0,0],3:[0,0],4:[0,0],5:[0,0],6:[0,0],7:[0,0],8:[0,0],9:[0,0],10:[0,0]}

features_loc = ["JUNCTIONTYPE_INT", "ADDRTYPE_INT"]

for feature_name in features_loc:
    print(feature_name)
    #outer loop over severity code (1/2)
    for sevcode in range(1,3,1):
        feature_length = df_c_fin[feature_name].nunique() #count number of unique features

        for feature_code in range(1,feature_length+1,1):
            sev_dict[feature_code][sevcode-1] = df_c_fin[(df_c_fin["SEVERITYCODE"] == sevcode) & (df_c_fin[feature_name] == feature_code)].shape[0]


            total = sev_dict[feature_code][0]+sev_dict[feature_code][1]
            if sevcode == 2:
                print(f"  {round(sev_dict[feature_code][1] / (total) * 100,2)}% of accidents with feature code = {feature_code} have {target}=2. ({total} counts)")
        
JUNCTIONTYPE_INT
  21.64% of accidents with feature code = 1 have SEVERITYCODE=2. (89517 counts)
  43.27% of accidents with feature code = 2 have SEVERITYCODE=2. (62786 counts)
  32.03% of accidents with feature code = 3 have SEVERITYCODE=2. (22775 counts)
  5.38% of accidents with feature code = 4 have SEVERITYCODE=2. (4739 counts)
  29.72% of accidents with feature code = 5 have SEVERITYCODE=2. (2096 counts)
  31.71% of accidents with feature code = 6 have SEVERITYCODE=2. (164 counts)
  30.3% of accidents with feature code = 7 have SEVERITYCODE=2. (10670 counts)
ADDRTYPE_INT
  23.71% of accidents with feature code = 1 have SEVERITYCODE=2. (126926 counts)
  10.92% of accidents with feature code = 2 have SEVERITYCODE=2. (751 counts)
  42.75% of accidents with feature code = 3 have SEVERITYCODE=2. (65070 counts)

Analysis of data features JUNCTIONTYPE_INT and ADDRTYPE_INT, describing the location of the collision shows that most accidents leading to injuries are intersection related (43.27% and 42.75%, respectively). The features contain similar information. Due to the higher granularity, JUNCTIONTYPE_INT will be kept.

4.5 Model Building

This subsection is split into two parts: 1) augmenting and encoding data and 2) model testing.

4.5.1 Augmenting and Encoding Data

One-Hot Encoder

One-Hot Encoding is a process in the data processing that is applied to categorical data, to convert it into a binary vector representation for use in machine learning algorithms.

This method is required to use categorical data in ML techniques as they assume natural ordering between the encoded integers, which results in a poor model performance. For the modelling, the following features will be one-hot encoded: JUNCTIONTYPE, LIGHTCOND, WEATHER, ROADCOND. More information can be found here.

In [37]:
# One-Hot encode categorical data of cleaned-up dataframe df_c_fin
df_weather = pd.get_dummies(df_c_fin.WEATHER, prefix="WEATHER")
df_roadcond = pd.get_dummies(df_c_fin.ROADCOND, prefix="ROADCOND")
df_junction = pd.get_dummies(df_c_fin.JUNCTIONTYPE, prefix="JUNCTIONTYPE")
df_light = pd.get_dummies(df_c_fin.LIGHTCOND, prefix="LIGHTCOND") 

df_1hot = pd.concat([df_light, df_junction, df_roadcond, df_weather], axis=1)

#create list
features_1hot = list(df_1hot.columns)
In [38]:
# Merge dataframes and define feature list
df_fin = pd.concat([df_c_fin, df_1hot], axis=1)
In [45]:
# create list with encoded features to be used for model building
target = "SEVERITYCODE"

features_loc = ["JUNCTIONTYPE_INT"]
features_ext = ["LIGHTCOND_INT", "WEATHER_INT", "ROADCOND_INT"]
features_ped = ["PEDCOUNT", "PEDCYLCOUNT", "VEHCOUNT"]
features_law = ["SPEEDING", "UNDERINFL", "INATTENTIONIND"]

features_final = features_law + features_1hot

Split Dataset

In [186]:
# Split data set into training and testing data
split_ratio = 0.4

X = df_fin[features_final]
Y = df_fin[target]

x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=split_ratio,random_state=0)

print(f"----TRAINING / TESTING DATA CREATED----")
print(f"Training Data: {x_train.shape},{y_train.shape}")
print(f"Testing Data: {X_test.shape} x {Y_train.shape}")
print(f"Ratio: {split_ratio}")
----TRAINING / TESTING DATA CREATED----
Training Data: (115648, 35),(115648,)
Testing Data: (77099, 35) x (115648,)
Ratio: 0.4

Synthetic Minority Oversampling Technique (SMOTE)

Method for oversampling examples in an underrepresented feature classs in order to overcome imbalanced classification bias. More information can be found here

In [187]:
os = SMOTE(random_state=1)
#x_train_SMOTE, y_train_SMOTE = os.fit_sample(x_train, y_train)
print(f"----TRAINING DATA AUGMENTED for {target}----")
print(f"Relative Distribution for {target}\nInput Training Data x_train / y_train")
print(y_train.value_counts()/len(y_train))
print('')
print(f"Relative Distribution for {target}\nAugmented Training Data x_train_SMOTE / y_train_SMOTE")
print(pd.Series(y_train_SMOTE).value_counts()/len(y_train_SMOTE))
----TRAINING DATA AUGMENTED for SEVERITYCODE----
Relative Distribution for SEVERITYCODE
Input Training Data x_train / y_train
1    0.697643
2    0.302357
Name: SEVERITYCODE, dtype: float64

Relative Distribution for SEVERITYCODE
Augmented Training Data x_train_SMOTE / y_train_SMOTE
2    0.5
1    0.5
Name: SEVERITYCODE, dtype: float64

4.5.2 Model Testing

4.5.2.1 K-Nearest Neighbours (KNN)

KNN is a non-parametric, supervised machine learning algorithm used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output of the model is a class membership.

An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. For k > 1 the label, which is most frequent among the k training samples nearest to that query point is assigned to the query / test point.

Since this algorithm relies on distance for classification, normalisation of input data is important (reference).

Note that KNNs have a high prediction cost for high-dimensional data, due to calculating the Euclidean distance in the multi-dimensional feature space.

Fig. 2: Example of k-NN classification. The test sample (green dot) should be classified either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle). (source).

In [238]:
# KNN classifier for unbalanced, encoded input dataset x_train, y_train, x_test, y_test

Ks = 8 # number of iterations
conf_mat = []
report = []
Yhat = []
f1_weight_avg = []

for n in range(1,Ks+1):
    classifier = KNeighborsClassifier(n_neighbors = n).fit(x_train, y_train)
    Yhat.append(classifier.predict(x_test))
    
    conf_mat.append(confusion_matrix(y_test, Yhat[n-1]))
    report.append(classification_report(y_test, Yhat[n-1]))

    f1_weight_avg.append(f1_score(y_test, Yhat[n-1], average='weighted'))


# get results for maximum mean_accuracy score:
Ks_best = np.argmax(f1_weight_avg)+1
conf_mat_best = conf_mat[Ks_best-1]
report_best = report[Ks_best-1]
Yhat_best = Yhat[Ks_best-1]
In [243]:
print(report_best)
              precision    recall  f1-score   support

           1       0.74      0.80      0.77     54069
           2       0.42      0.35      0.38     23030

    accuracy                           0.66     77099
   macro avg       0.58      0.57      0.57     77099
weighted avg       0.65      0.66      0.65     77099

In [282]:
fig = plt.figure(figsize=(20, 8)) # create figure
fig.suptitle("KNN Model QC", fontsize='20')
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)


#Qualitative assessment of the quality of the model by comparing the value distribution 
# of the real data with the predicted values.

ax1 = sns.distplot(Y, hist=True, color="r", label="Actual Value", ax=ax1)
ax1 = sns.distplot(Yhat_best, hist=True, color="b", label="Fitted Values", ax=ax1)

ax1.set_ylabel("Observed data")
ax1.set_xlabel("SEVERITYCODE")
ax1.set_title('Distribution plot | KNN: Unbalanced data K='+str(Ks_best))
ax1.legend()


# Weighted F1 score development with K

ax2 = plt.plot(range(1,Ks+1),f1_weight_avg)

plt.title('Weighted F1 score vs K value | KNN: Unbalanced data K='+str(Ks_best))
plt.ylabel("F1 weighted avg")
plt.xlabel("K")
plt.show()

The classification report shows the main classification metrics precision, recall and f1-score on a per-class basis. The metrics are calculated by using true and false positives, true and false negatives. Positive and negative in this case are generic names for the predicted classes:

1) Precision: Accuracy of positive predictions. Precision = True Positives/(True Positives + False Positives)

2) Recall: Ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives and false negatives (Fraction of positives that were correctly identified). Recall = True Positives / (True Positives + False Negatives)

3) F1 score: This is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. F1 Score = 2*(Recall * Precision) / (Recall + Precision)

4) Support is the number of occurrences of each class in the correct target values.

5) Accuracy: Fraction of predictions the model got right. Accuracy = Number of correct predictions / Total number of predictions

6) Macro F1 calculates the F1 separated by class but not using weights for the aggregation which resuls in a bigger penalisation when your model does not perform well with the minority classes.

7) Weighted F1 score calculates the F1 score for each class independently. When it adds F1 scores per class together, it uses a weight that depends on the number of true labels of each class.

More info can be found here.

In [246]:
# KNN classifier for balanced, encoded input dataset x_train_SMOTE, y_train_SMOTE, x_test, y_test

Ks = 8 # number of iterations
conf_mat_SMOTE = []
report_SMOTE = []
Yhat_SMOTE = []
f1_weight_avg_SMOTE = []

for n in range(1,Ks+1):
    classifier_SMOTE = KNeighborsClassifier(n_neighbors = n).fit(x_train_SMOTE, y_train_SMOTE)
    Yhat_SMOTE.append(classifier_SMOTE.predict(x_test))
    
    conf_mat_SMOTE.append(confusion_matrix(y_test, Yhat_SMOTE[n-1]))
    report_SMOTE.append(classification_report(y_test, Yhat_SMOTE[n-1]))

    f1_weight_avg_SMOTE.append(f1_score(y_test, Yhat_SMOTE[n-1], average='weighted'))


# get results for maximum mean_accuracy score:
Ks_best_SMOTE = np.argmax(f1_weight_avg_SMOTE)+1
conf_mat_best_SMOTE = conf_mat_SMOTE[Ks_best_SMOTE-1]
report_best_SMOTE = report_SMOTE[Ks_best_SMOTE-1]
Yhat_best_SMOTE = Yhat_SMOTE[Ks_best_SMOTE-1]
In [247]:
print(report_best_SMOTE)
              precision    recall  f1-score   support

           1       0.73      0.83      0.78     54069
           2       0.42      0.29      0.34     23030

    accuracy                           0.67     77099
   macro avg       0.58      0.56      0.56     77099
weighted avg       0.64      0.67      0.65     77099

In [281]:
fig = plt.figure(figsize=(20, 8)) # create figure
fig.suptitle("KNN Model QC SMOTE", fontsize='20')
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)


#Qualitative assessment of the quality of the model by comparing the value distribution 
# of the real data with the predicted values.

ax1 = sns.distplot(Y, hist=True, color="r", label="Actual Value", ax=ax1)
ax1 = sns.distplot(Yhat_best_SMOTE, hist=True, color="b", label="Fitted Values", ax=ax1)

ax1.set_ylabel("Observed data")
ax1.set_xlabel("SEVERITYCODE")
ax1.set_title('Distribution plot | KNN: SMOTE-Balanced data K='+str(Ks_best_SMOTE))
ax1.legend()


# Weighted F1 score development with K

ax2 = plt.plot(range(1,Ks+1),f1_weight_avg_SMOTE)

plt.title('Weighted F1 score vs K value | KNN: SMOTE-Balanced data K='+str(Ks_best_SMOTE))
plt.ylabel("F1 weighted avg")
plt.xlabel("K")
plt.show()

The high precision and recall value for class 1 suggests, that the model is good in predicting SEVERITYCODE=1 accidents. The low values for SEVERITYCODE=2, however suggest that most collisions, resulting in injuries will be predicted incorrectly.

4.5.2.2 Decision Tree

A decision tree is a flowchart-like tree structure where an internal node represents feature(or attribute), the branch represents a decision rule, and each leaf node represents the outcome. The top most node in a decision tree is known as the root node. It learns to partition on the basis of the attribute value and partitions the tree in recursively manner call recursive partitioning. This flowchart-like structure helps decision making.

It's visualization like a flowchart diagram which easily mimics the human level thinking. That is why decision trees are easy to understand and interpret. More information can be found here.

Algorithm

  • Select the best attribute using Attribute Selection Measures(ASM) to split the records.
  • Make that attribute a decision node and breaks the dataset into smaller subsets.
  • Starts tree building by repeating this process recursively for each child until one of the condition will match:
    • All the tuples belong to the same attribute value.
    • There are no more remaining attributes.
    • There are no more instances.

In [415]:
# AUC is a good way for evaluation for this type of problems.

max_depths = np.linspace(3, 21, 19, endpoint=True)


report_tree = []
f1_weight_avg_tree = []

train_results = []
test_results = []
Yhat_tree_list = []


for n in max_depths:
    
    n=int(n)

    clf = DecisionTreeClassifier(max_depth=n, criterion="entropy",
                                 max_features=None, max_leaf_nodes=None,
                                 min_samples_leaf=1,
                                 min_samples_split=2, min_weight_fraction_leaf=0.0,
                                 random_state=None, splitter="best")

    clf = clf.fit(x_train,y_train)
    
    
    Yhat_tree = clf.predict(x_train)
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train-1, Yhat_tree-1)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    train_results.append(roc_auc)
    
    
    Yhat_tree_test = clf.predict(x_test)
    Yhat_tree_list.append(Yhat_tree_test)
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test-1, Yhat_tree_test-1)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    test_results.append(roc_auc)
    
    report_tree.append(classification_report(y_test, Yhat_tree_test))
    f1_weight_avg_tree.append(f1_score(y_test, Yhat_tree_test, average='weighted'))
    

test_results_best_tree = np.argmax(test_results)+1 #iteration number (= list index + 1)
depth_tree_max = max_depths[test_results_best_tree-1] # tree depth value absolute for maximum value
report_best_tree = report_tree[test_results_best_tree-1]
Yhat_tree_test_best = Yhat_tree_list[test_results_best_tree-1]

print(depth_tree_max)
print(report_best_tree)
17.0
              precision    recall  f1-score   support

           1       0.70      0.98      0.82     54069
           2       0.45      0.03      0.06     23030

    accuracy                           0.70     77099
   macro avg       0.58      0.51      0.44     77099
weighted avg       0.63      0.70      0.59     77099


The high precision and recall value for class 1 suggests, that the model is good in predicting SEVERITYCODE=1 accidents. The low values for SEVERITYCODE=2, however suggest that most collisions, resulting in injuries will not be predicted as such.

In [416]:
fig = plt.figure(figsize=(20, 8)) # create figure
fig.suptitle("Decision Tree Model QC", fontsize='20')
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)


#Qualitative assessment of the quality of the model by comparing the value distribution 
# of the real data with the predicted values.

ax1 = sns.distplot(Y, hist=True, color="r", label="Actual Value", ax=ax1)
ax1 = sns.distplot(Yhat_tree_test_best, color='b', hist=True, label="Fitted Values", ax=ax1)

ax1.set_ylabel("Observed data")
ax1.set_xlabel("SEVERITYCODE")
ax1.set_title('Distribution plot | Decision Tree: Best depth='+str(depth_tree_max))
ax1.legend()


ax2 = plt.plot(max_depths, train_results, 'k', label="Train AUC")
ax2 = plt.plot(max_depths, test_results, 'b', label="Test AUC")
ax2 = plt.plot(depth_tree_max,test_results[test_results_best_tree -1], 'X', color='k')
ax2 = plt.text(depth_tree_max-1,test_results[test_results_best_tree -1]+0.0005, ('maximum AUC score'), color='k')

plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel('AUC score')
plt.xlabel('Tree depth')
plt.title('Area Under Curve (AUC) Score vs Decision Tree Depth | Best Depth for n='+str(depth_tree_max))
plt.show()

4.5.2.3 Logistic Regression

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

More information can be found here.

In [427]:
C = np.linspace(1,10,10, endpoint=True) #Inverse of regularization strength


report_logreg = []
f1_weight_avg_logreg = []

train_results = []
test_results = []
Yhat_logreg_list = []


for n in C:
    
    n=int(n)

    clf = LogisticRegression(C=n, solver='liblinear',penalty='l2')

    clf = clf.fit(x_train,y_train)
    
    
    Yhat_logreg = clf.predict(x_train)
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_train-1, Yhat_logreg-1)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    train_results.append(roc_auc)
    
    
    Yhat_logreg_test = clf.predict(x_test)
    Yhat_logreg_list.append(Yhat_logreg_test)
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test-1, Yhat_logreg_test-1)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    test_results.append(roc_auc)
    
    report_logreg.append(classification_report(y_test, Yhat_logreg_test))
    f1_weight_avg_logreg.append(f1_score(y_test, Yhat_logreg_test, average='weighted'))
    

test_results_best_logreg = np.argmax(test_results)+1 #iteration number (= list index + 1)
depth_logreg_max = C[test_results_best_logreg-1] # tree depth value absolute for maximum value
report_best_logreg = report_logreg[test_results_best_logreg-1]
Yhat_logreg_test_best = Yhat_logreg_list[test_results_best_logreg-1]

print(depth_logreg_max)
print(report_best_logreg)
3.0
              precision    recall  f1-score   support

           1       0.71      0.96      0.82     54069
           2       0.46      0.07      0.12     23030

    accuracy                           0.70     77099
   macro avg       0.58      0.52      0.47     77099
weighted avg       0.63      0.70      0.61     77099


Similar to the other models, the high precision and recall value for class 1 suggests, that the model is good in predicting SEVERITYCODE=1 accidents. The low values for SEVERITYCODE=2, however suggest that most collisions, resulting in injuries will not be predicted as such.

In [433]:
fig = plt.figure(figsize=(20, 8)) # create figure
fig.suptitle("Logistic Regression Model QC", fontsize='20')
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)


#Qualitative assessment of the quality of the model by comparing the value distribution 
# of the real data with the predicted values.

ax1 = sns.distplot(Y, hist=True, color="r", label="Actual Value", ax=ax1)
ax1 = sns.distplot(Yhat_logreg_test_best, color='b', hist=True, label="Fitted Values", ax=ax1)

ax1.set_ylabel("Observed data")
ax1.set_xlabel("SEVERITYCODE")
ax1.set_title('Distribution plot | Decision Tree: Best inverse of regularization strength C='+str(depth_logreg_max))
ax1.legend()


ax2 = plt.plot(C, train_results, 'k', label="Train AUC")
ax2 = plt.plot(C, test_results, 'b', label="Test AUC")
ax2 = plt.plot(depth_logreg_max,test_results[test_results_best_logreg -1], 'X', color='k')
ax2 = plt.text(depth_logreg_max-1,test_results[test_results_best_logreg], ('maximum AUC score'), color='k')

plt.legend(handler_map={line1: HandlerLine2D(numpoints=2)})
plt.ylabel('AUC score')
plt.xlabel('C')
plt.title('Area Under Curve (AUC) Score vs C | Best inverse of regularization strength C='+str(depth_logreg_max))
plt.show()

5. Modelling Results

This modelling exercise tried to predict traffic accident severity (separating between property and physical damage) using supervised machine learning techniques on a dataset from Seattle, USA. Different modelling techniques were tested, using the same cleaned-up dataset containing 46 encoded features and 192747 entries before and 238461 entries after applying Synthetic Minority Oversampling Technique on the training dataset (SMOTE). Both pre-conditioning steps were required as the dataset was imbalanced and mostly contained categorical data. Due to the limited amount of data, 60% of the input data was used for training and 40% for testing. The dataset was acquired in March 2013.

Data Exploration

Initial exploratory data analysis revealed, that 89.85% of accidents involving pedestrians and 87.53% of collisions involving cyclists lead to injuries (SEVERITYCODE = 2). This high level of probability could not be increased by involving additional features into the model training. The active participation of bikes and pedestrians in day-to-day traffice poses the highest risk to injuries. The risk of physical damage is further increased at junctions.

Machine Learning Models

Three different algorithms were tested in order to predict traffic accident severity in collisions: K-Neirest Neighbours (KNN), Decision Tree, and Logistic Regression.

a) K-Neirest Neighbours (KNN)

KNN models were tested on both, the original and the augmented dataset for comparison. For assessing the quality of the model, the weighted F1 score and the distribution plot was used to assess the number of false positives.

          precision    recall  f1-score   support

       1       0.74      0.80      0.77     54069
       2       0.42      0.35      0.38     23030

accuracy                           0.66     77099
macro avg      0.58      0.57      0.57     77099
weighted avg   0.65      0.66      0.65     77099


With high precision and recall values of 0.74 and 0.80 for SEVERITYCLASS=1 and 0.42 and 0.35 for SEVERITYCLASS=2, respectively for K=8, this model performs best amongst the models tested. Using the SMOTE-augmented training dataset for training purposes, yields comparable results.

          precision    recall  f1-score   support

       1       0.73      0.83      0.78     54069
       2       0.42      0.29      0.34     23030

accuracy                           0.67     77099
macro avg      0.58      0.56      0.56     77099
weighted avg   0.64      0.67      0.65     77099


While the distribution plot shows a relatively good distribution of predicted SEVERITYCLASS values, the confusion matrix reveals a large number of false positives for both cases. Practically, this implies that, accidents, resulting in an injury will be misclassified as property damage only which can potentially end up fatally when ressources are allocated incorrectly.

b) Decision Tree

In addition to the KNN classifier, decision trees were tested to reliably predict SEVERITYCLASS for accidents in Seattle. This evaluation was conducted on the non-augmented training data.

As the classification report below suggests, the model successfully predicts SEVERITYCLASS=1 incidents, however, the low recall value for SEVERITYCLASS=2 indicates a large number of false negatives and therefore a low F1 score for that class.

          precision    recall  f1-score   support

       1       0.70      0.98      0.82     54069
       2       0.45      0.03      0.06     23030

accuracy                           0.70     77099
macro avg      0.58      0.51      0.44     77099
weighted avg   0.63      0.70      0.59     77099


In addition to the weighted f1 score for model evaluation, the Area Under Curve (AUC) score was tested, which yields the best results for an inverse of regularization strength of 17. The distribution plot indicates, that the model overpredicts SEVERITYCLASS=1, but fails to reliably predict SEVERITYCLASS=2.

c) Logistic Regression

Last but not least, logistic regressors were tested to predict SEVERITYCLASS for accidents in Seattle from the labeled input dataset.

As the classification report below suggests, the model performs slightly better than the decision tree and manages to accurately predict SEVERITYCLASS=1 incidents, however, the low recall value for SEVERITYCLASS=2 indicates a large number of false negatives and therefore a relatively low F1 score for that class.

          precision    recall  f1-score   support

       1       0.71      0.96      0.82     54069
       2       0.46      0.07      0.12     23030

accuracy                            0.70     77099
macro avg       0.58      0.52      0.47     77099
weighted avg    0.63      0.70      0.61     77099

In addition to the weighted f1 score for modele valuation, the Area Under Curve (AUC) score was tested, which yields the best results for a decision tree depth of 3. The model suffers from the same issues as the decision tree and overpredicts SEVERITYCLASS=1, but fails to reliably predict SEVERITYCLASS=2.

6. Conclusions

This Capstone project served to provide an insight into the lifecycle of a typical data scientific project and offered the possibility to apply the previously learned theory on a real-life example by means of exploratory data analysis and testing and benchmarking various machine learning models against each other with the ultimate goal to predict accident severity from meta data, acquired by the the Seattle Police Department (SPD).

In total, three different methods were tested against each other to classify the SEVERITYCLASS, namely K-Neirest Neighbours (KNN), Decision Trees, and Logistic Regressors.

While the exploratory data analysis suggests, that almost 90% of all accidents, involving pedestrians and 88% of collisions involving cyclists lead to injuries (compare: 28% of accidents without pedestrians/cyclists lead to injuries), the tested machine learning models generally had problems to correctly predict SEVERITYCLASS=2 and therefore exhibited a high number of false negatives for this class which can in real life lead to a wrong allocation of ressources of police and first responders and potentially end deadly.

The best method tested was the KNN.

7. Modelling Limitations

  • The dataset covers only one month and is biased by day time (60% of values recorded during the day vs 40% recorded at night). For a model, that is applicable for the whole year (and all weather conditions), additional months need to be considered for the training and potential split into week days / weekend etc.
  • Model accuracy could potentially be further improved by grouping dependent features into groups and thus reducing model bias (e.g. good-moderate-bad driving conditions, which is influenced by LIGHTCOND, WEATHER, ROADCOND).
  • Due to the fundamentally different traffic situation during day- and night time / workday and weekend, it could make sense to test the efficiency of various different models vs one model, containing all eventualities.
  • The model is currently biased by the average boundary conditions of a single month, with some feature attributes more represented than others. An improvement of the model is further expected by balancing them them and using more training data.
  • Analysis of the relative distribution of accidents, leading to property damage only (SEVERITYCODE=1) and accidents resulting in injuries (SEVERITYCODE=2), shows that with 30%, the latter is underrepresented in the unbalanced input dataset. This was compensated for by augmenting additional data using SMOTE.
  • Individual model performances vs. number of input features could have been tested more thoroughly. In light of the course objective (i.e. demonstrating and applying the learned skills of the lectures), however, it was decided that this level of detail is appropriate.